Int8 Quantization Support for DiT (Z-Image & Qwen-Image)#1470
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54149deb01
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
May i ask your way to measure the memory usage? I didn't take kv cache into account and only record weight for model loaded. Seems like there is some discrepancies between the two. |
I observed the peak VRAM during runtime by using |
lishunyang12
left a comment
There was a problem hiding this comment.
left a couple comments, mostly around the top-level torch_npu import
| from typing import TYPE_CHECKING, Any, Optional | ||
|
|
||
| import torch | ||
| import torch_npu |
There was a problem hiding this comment.
import torch_npu at module top level will crash on any non-NPU machine. Since __init__.py imports DiffusionInt8Config unconditionally, this breaks the entire vllm_omni.diffusion.quantization package — including FP8 codepaths.
Move this to a lazy import inside the methods that actually call torch_npu.* (e.g. apply(), process_weights_after_loading()).
There was a problem hiding this comment.
Following your suggestion, I have using lazy import.
| replace_parameter(layer, "weight_scale", weight_scale) | ||
|
|
||
|
|
||
| class DiffusionInt8Config(DiffusionQuantizationConfig): |
There was a problem hiding this comment.
Missing quant_config_cls = Int8Config — without it get_name() from the base class raises NotImplementedError. See how DiffusionFp8Config sets quant_config_cls = Fp8Config.
There was a problem hiding this comment.
Following your suggestion, I have added quantic_fig_cls = Int8Config.
There was a problem hiding this comment.
Confirmed in the diff. Thanks.
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| CONDITION_IMAGE_SIZE = 384 * 384 |
There was a problem hiding this comment.
The CONDITION_IMAGE_SIZE / VAE_IMAGE_SIZE refactor seems unrelated to int8 quantization. Worth splitting into its own PR to keep review scope tight.
There was a problem hiding this comment.
This was caused by an incorrect commit and has now been removed from the branch.
Signed-off-by: juboyu <767868009@qq.com>
Signed-off-by: juboyu <767868009@qq.com>
Signed-off-by: juboyu <767868009@qq.com>
Signed-off-by: juboyu <767868009@qq.com>
Signed-off-by: juboyu <767868009@qq.com>
Signed-off-by: juboyu <767868009@qq.com>
Signed-off-by: juboyu <767868009@qq.com>
88f7f80 to
a3fcc33
Compare
Signed-off-by: juboyu <767868009@qq.com>
Signed-off-by: juboyu <767868009@qq.com>
|
Can you post visual output for z-image? |
I will upload the visual output for z-image later. |
Signed-off-by: JuboYu <767868009@qq.com>
Signed-off-by: juboyu <767868009@qq.com>
|
@codex review |
|
any test result in gpu? |
|
|
||
| ## Device Compatibility for Int8 | ||
|
|
||
| | NPU Generation | Int8 Mode | |
There was a problem hiding this comment.
The GPU should also support it, right?
There was a problem hiding this comment.
This version only supports NPUs. The GPU version needs to be developed. I don't think GPUs need INT8, there are better options for FP8, while NPU currently only supports INT8.
There was a problem hiding this comment.
Maybe you should add a note for TODO GPU support int8
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a117002e76
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| def from_config(cls, config: dict[str, Any]) -> "Int8Config": | ||
| quant_method = cls.get_from_keys(config, ["quant_method"]) | ||
| is_checkpoint_int8_serialized = "int8" in quant_method | ||
| activation_scheme = cls.get_from_keys(config, ["activation_scheme"]) |
There was a problem hiding this comment.
Fall back to dynamic activation scheme in from_config
Int8Config.from_config treats activation_scheme as mandatory via get_from_keys, so any int8 quantization config that omits this field will raise during config parsing instead of using the class default. This breaks loading paths that rely on minimal quantization metadata (e.g., only quant_method present) even though __init__ already defines "dynamic" as the default scheme; switching to an optional lookup with a default avoids this hard failure.
Useful? React with 👍 / 👎.
…he prefix for quantization ignored_layers Signed-off-by: juboyu <767868009@qq.com>
Signed-off-by: JuboYu <767868009@qq.com>
|
Please fix pre-commit |
Signed-off-by: juboyu <767868009@qq.com>
|
@lishunyang12 @david6666666 @hsliuustc0106 This PR has been prepared and is ready for your review. |
…t#1470) Signed-off-by: juboyu <767868009@qq.com> Signed-off-by: JuboYu <767868009@qq.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
…t#1470) Signed-off-by: juboyu <767868009@qq.com> Signed-off-by: JuboYu <767868009@qq.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
…t#1470) Signed-off-by: juboyu <767868009@qq.com> Signed-off-by: JuboYu <767868009@qq.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Per PR vllm-project#1470's review template, "量化展示" requires two tables: the Summary table (already emitted) and a Memory Profiling table that breaks Peak into Weights + Activations. Capture the extra weights/activations numbers in `_generate_image` and `_generate_video` by snapshotting `memory_allocated()` right before each `generate()` call (= weights + persistent buffers already on device) and subtracting it from the post-generate `max_memory_allocated()` to get the activations delta. Surface the values in `run_benchmark` as a third markdown table ("### Memory Profiling") with the columns PR vllm-project#1470 used: Weights / Activations / Peak / Total Reduction, broken down by TP size (from `args.tensor_parallel_size`). First-prompt snapshot is canonical, matching the existing Peak column's "use first prompt's memory" convention. Signed-off-by: ultranationalism <www913363043@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultism <www913363043@gmail.com>
…ponent glue Pivot from the closed PR vllm-project#1986 design (runtime nunchaku-format glue) to the post-RFC architecture where: - vLLM upstream (vllm-project/vllm) hosts the SVDQuant quantization config, linear method, dispatcher, and native SM_100/103 CuTe DSL kernel. - vllm-omni hosts only diffusion-specific glue and a one-time offline converter that emits canonical row-major NVFP4 checkpoints. Components: - `vllm_omni/quantization/tools/convert_nunchaku_to_svdquant.py`: Standalone converter that ingests nunchaku-published merged safetensors and emits a vLLM-loadable diffusers pipeline tree in canonical row-major + FP4 nibble pack. This is the layout the SM_100 native CuTe kernel consumes directly; for the nunchaku backend on consumer GPUs (SM_75-89 / SM_120), vLLM repacks to the PTX-MMA tile layout at load time in `SVDQuantLinearMethod.process_weights_after_loading`. The on-disk format is backend-agnostic. - `vllm_omni/quantization/tools/svdquant_nvfp4_layout.py`: Thin re-export shim. The actual layout adapters live in `vllm/model_executor/layers/quantization/utils/svdquant_nvfp4_layout.py` in vLLM proper; this file keeps the import surface stable for downstream code that referenced the original vllm-omni location. - `vllm_omni/quantization/component_config.py`: per-component quantization config wiring so per-pipeline-component (transformer, text_encoder, vae, etc.) quant config can be declared declaratively in `transformer/config.json["quantization_config"]` rather than runtime monkey-patching. Addresses the "blanket strict-validation disable" review concern from vllm-project#1986. - `vllm_omni/diffusion/models/z_image/z_image_transformer.py`: trailing -dot fix in `stacked_params_mapping` (replaces the per-model diffusers->vLLM key remapping from the closed PR; reduced to one line under the canonical row-major design). - `examples/offline_inference/text_to_image/text_to_image.py`: smarter quantization label resolution that mirrors `OmniDiffusionConfig._propagate_quantization_from_tf_config` so the startup banner reflects on-disk per-component quant config rather than printing "None (BF16)" for an already-quantized checkpoint. Canonical checkpoint produced by this converter: - HuggingFace: https://huggingface.co/ultranationalism/nunchaku-z-image-turbo-svdq - ModelScope: https://www.modelscope.cn/models/ultranationalism/Z-Image-Turbo-SVDQuant-NVFP4 Test plan (pending validation on consumer Blackwell SM_120): - E2E quantized Z-Image-Turbo inference on RTX 5090 - BF16 vs SVDQuant LPIPS quality benchmark per PR vllm-project#1470 template Refs: vllm-project#1986 (closed), RFC vllm-project/vllm#37908 AI assistance: this commit was produced with Claude Code assistance. Signed-off-by: ultranationalism <www913363043@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultism <www913363043@gmail.com>
Per PR vllm-project#1470's review template, "量化展示" requires two tables: the Summary table (already emitted) and a Memory Profiling table that breaks Peak into Weights + Activations. Capture the extra weights/activations numbers in `_generate_image` and `_generate_video` by snapshotting `memory_allocated()` right before each `generate()` call (= weights + persistent buffers already on device) and subtracting it from the post-generate `max_memory_allocated()` to get the activations delta. Surface the values in `run_benchmark` as a third markdown table ("### Memory Profiling") with the columns PR vllm-project#1470 used: Weights / Activations / Peak / Total Reduction, broken down by TP size (from `args.tensor_parallel_size`). First-prompt snapshot is canonical, matching the existing Peak column's "use first prompt's memory" convention. Signed-off-by: ultranationalism <www913363043@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultism <www913363043@gmail.com>
…t#1470) Signed-off-by: juboyu <767868009@qq.com> Signed-off-by: JuboYu <767868009@qq.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Overview
This PR introduces online int8 quantization for DiT models in vLLM-Omni based. The currently supported model range follows the FP8 quantization, starting with Z-Image and Qwen-Image (text-to-image). Online int8 converts BF16/FP16 weights to int8 at model load time with dynamic activation scaling.
Device compatibility: W8A8
Per-layer control: ignored_layers lets users keep sensitive layers in BF16
Supported Models
ignored_layersTongyi-MAI/Z-Image-TurboQwen/Qwen-Image,Qwen/Qwen-Image-2512Changes
Quantization framework (vllm_omni/diffusion/quantization/)
int8.py— DiffusionInt8Config with Int8Config for device (dynamic activation scaling, online weight conversion)__init__.py— DiffusionInt8Config is added and registered to the supported quantization methodsTests (tests/diffusion/quantization/)
test_int8_config.py— Unit tests covering config creation, vLLM config extraction, ignored_layers, dict non-mutation, conflicting method warnings, and end-to-end integration with OmniDiffusionConfigDocumentation (docs/)
user_guide/diffusion/quantization/overview.md— Quantization methods overviewuser_guide/diffusion/quantization/int8.md— Usage guide (Python API + CLI), parameter reference, per-model recommendationsuser_guide/diffusion_acceleration.md— Updated model support table with int8 column.nav.yml— Added int8 quantization section to docs navigationExample (examples/offline_inference/text_to_image/text_to_image.py)
How to Use
Test Plan
Test Result
Quantization Quality Benchmark for GPU
Quantization Quality Benchmark for Atlas A2
Memory Profiling
Qwen-Image
Z-Image
Related Issues
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)